Introduction
Todays fast-changing retail industry expects retailers to know their customers shopping behavior beforehand. Sales optimization requires accurate prediction of customers shopping habits and fulfillment of the inventories in advance. Negligence to comply impedes customers shopping experience and diminishes the customer base. Today, let us explore the grocery dataset in the R to create an association between the 20 frequently sold items and complementary products bought with those items. We will find and use the correlation to suggest that retailers place such products in adjacent aisles for smooth shopping experiences and sales optimization.
1. Apriori Algorithm
Apriori algorithm, as the name suggests, uses prior information of frequent itemset properties to find relations between items. It applies either an iterative approach or level-wise search to find k+1 itemsets from k-frequent itemsets. The algorithm uses the Apriori property to improve the efficiency of level-wise generation of frequent itemsets, by reducing the search spaces. Apriori property states all non-empty subsets of a frequent itemset must be frequent. Let us take an example to further understand the algorithm.
The Cake dataset below consists of a few imaginary items purchased from a retail store:
The Association Rules:
The dataset helps us construct a set of rules as follows:
Rule 1: If Flour is purchased, then Egg is also purchased.
Rule 2: If Egg is purchased, then Flour is also purchased.
Rule 3: If Flour and Eggs are purchased, then Sugarr is also purchased in 60% of the transactions.
Above rules, explicitly state:
1. Whenever Flour is purchased, Egg is also purchased or vice versa.
2.If Flour and Egg are purchased then the Sugar is also purchased. This is true in 3 out of the 5 transactions.
If {Flour} and {Sugar} both are one-item sets, a new set {Flour, Sugar} can be created with the information . The new set is used to identify the products purchased when both flour and sugar are purchased. Let us look at a suitable Right Hand Side (RHS) and Left Hand Side (LHS) for multiple items of a single transaction to form an association between item sets. Every purchase of {Sugar} with { Flour}, is represented as {Sugar} => {Flour}. Here {Sugar} and {Flour} is RHS and LHS respectively. This association can be used to find other k-items and k+1 itemset. Transactions including {Sugar, Flour} have high chaces of including Baking Soda.
In this method, the Apriori algorithm uses k-itemsets to search (k+1) itemsets. The first 1-item set iterates to find 2-item sets until (k+1) item set.
1.1 Dataset Handling
Let us analyze the “Groceries” data in R where the retailers store the transaction in a specific dataset called “Transaction”.
## Loading required package: arules
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
1.2 Overview
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
## .. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
## .. .. ..@ Dim : int [1:2] 169 9835
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 169 obs. of 3 variables:
## .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
## .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
## .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
The transaction dataset is internally divided into 3 different slots: data, itemInfo, and itemsetInfo. The data class contains multiple headers like dimensions, dimension names, and the number of products purchased in each transaction.
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
## 4 ham sausage meat and sausage
## 5 meat sausage meat and sausage
## 6 finished products sausage meat and sausage
## 7 organic sausage sausage meat and sausage
## 8 chicken poultry meat and sausage
## 9 turkey poultry meat and sausage
## 10 pork pork meat and sausage
## 11 beef beef meat and sausage
## 12 hamburger meat beef meat and sausage
## 13 fish fish meat and sausage
## 14 citrus fruit fruit fruit and vegetables
## 15 tropical fruit fruit fruit and vegetables
## 16 pip fruit fruit fruit and vegetables
## 17 grapes fruit fruit and vegetables
## 18 berries fruit fruit and vegetables
## 19 nuts/prunes fruit fruit and vegetables
## 20 root vegetables vegetables fruit and vegetables
The first 20 rows in the “itemInfo” class provides the name ofitems under the column “labels”. The “level1” generalizes the items and “level2” catetorizes it into specific domain, which helps in efficient correlations.
2. Implementing Apriori Algorithm
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The minimum support parameter (minSup) is set to .001. Minimum confidence (minConf) can take value between 0.75 and 0.85 for varied results. Further explanation about support, lift, and Confidence is given below:
Support:
Support can be understood as the general probability of a particular event occurring. For example, lets assume an event named ‘Buy’, which represents buying a product. In this case, the support of ‘Buy’ is the number of transactions including ‘Buy’ divided by total number of transactions in the store.
Confidence:
The confidence of an event is the conditional probability of the occurrence of any event after one particular event has occurred. In general terms, it is more like chances of A happening given that B has already occurred.
Lift:
The ratio of confidence to expected confidence is the lift. The probability of all of the items in a rule occurring, divided by the product of the probabilities of the items on the left and right side occurring is lift. The lift value represents the quality of rule to predict associate between items. Higher the lift, stronger the association.
The top 20 rules arranged by lift:
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {curd,
## cereals} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [3] {yogurt,
## cereals} => {whole milk} 0.001728521 0.8095238 0.002135231 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 0.001220132 3.261374 10
## [5] {soups,
## bottled beer} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## [6] {napkins,
## house keeping products} => {whole milk} 0.001321810 0.8125000 0.001626843 3.179840 13
## [7] {whipped/sour cream,
## house keeping products} => {whole milk} 0.001220132 0.9230769 0.001321810 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [9] {turkey,
## curd} => {other vegetables} 0.001220132 0.8000000 0.001525165 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 0.001220132 3.913649 12
## [11] {butter,
## rice} => {whole milk} 0.001525165 0.8333333 0.001830198 3.261374 15
## [12] {domestic eggs,
## rice} => {whole milk} 0.001118454 0.8461538 0.001321810 3.311549 11
## [13] {rice,
## bottled water} => {whole milk} 0.001220132 0.9230769 0.001321810 3.612599 12
## [14] {yogurt,
## rice} => {other vegetables} 0.001931876 0.8260870 0.002338587 4.269346 19
## [15] {oil,
## mustard} => {whole milk} 0.001220132 0.8571429 0.001423488 3.354556 12
## [16] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1.0000000 0.001118454 3.913649 11
## [17] {herbs,
## fruit/vegetable juice} => {other vegetables} 0.001220132 0.8000000 0.001525165 4.134524 12
## [18] {herbs,
## shopping bags} => {other vegetables} 0.001931876 0.8260870 0.002338587 4.269346 19
## [19] {tropical fruit,
## herbs} => {whole milk} 0.002338587 0.8214286 0.002846975 3.214783 23
## [20] {herbs,
## rolls/buns} => {whole milk} 0.002440264 0.8000000 0.003050330 3.130919 24
Top 20 rules produced from Groceries data is given below. First rule states when Liquior and Red Wine is bought, it is likely bottled beer is also bought.
3. Interpretations and Analysis
3.1 The Item Frequency Histogram
Histogram below represents the frequency of an item occurred in the dataset as compared to other items. The relative frequency plot shows “Whole Milk” and “Other Vegetables” are among the tow two most purchased products.
## [1] 5.1 4.1 4.1 2.1
The graph above represents people buy milk and vegetable relatively more compared to other items in the store. Now, let us placed relatable items near milk and vegetables to optimize sales. Bread and eggs can be a great complement.
3.2 Graphical Representation
The graph below represents support and lifts of multiple items in the inventory and show association among those items. The size of the nodes is based on support levels and the color is based on lift ratios.
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
It is clear that most of the transactions are around Whole Milk. Liquor and wine also show strong associated. Similarly, tropical fruits and herbs have relations with rolls and buns. A bit off but, its okay! These items should be placed in the same asile.
Each black box represents a non-zero value which means a correlation between items and the transactions.
3.3 Interactive Scatterplot
The interactive plot visualizes association rules and plots a scattered plot. The x-axis and the y-axis represent support and confidence respectively. Let’s move around the scattered plot and see the results.
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Moving around the plot displays lift, support, and confidence for the set of items. Item set like, {Liquor, Red wine} => {Bottle beer} has a confidence of 0.95 and high lift of 11.2, it is a suitable set of items to place together.
4. Conclusion
After visualizing above plots, a more detailed and effective strategy can be implemented to place related items together. The Grocery dataset transaction has a strong correlation between “Whole Milk” with “Vegetables” and “Wine” with “Bottled Beer”. Some specific aisles allows customers to have a smooth and pleasant shopping experience with the ease of acces to related items.It acts as a catalyst to boost the store sales simultaneously.
Aisles Proposed:
Liquor Aisle – Liquor, Red/Blush Wine, Bottled Beer
Groceries Aisle – Other vegetables, Whole milk, Oil, Yogurt, Rice, Root Vegetable
Fruit Aisle – Citrus Fruit, Grape, Fruit/Vegetable juice, Tropical fruits
Breakfast Aisle – Pastry, Curd, Cereals, Sweet Spreads
Now you know the tricks retailers adopt for your convenient shopping experience and to boost their sales!
Keep shopping, enjoy shopping!!